Leveraging machine learning to distinguish between bacterial and viral induced pharyngitis using hematological markers: a retrospective cohort study

Accurate differentiation between bacterial and viral-induced pharyngitis is recognized as essential for personalized treatment and judicious antibiotic use. From a cohort of 693 patients with pharyngitis, data from 197 individuals clearly diagnosed with bacterial or viral infections were meticulously analyzed in this study. By integrating detailed hematological insights with several machine learning algorithms, including Random Forest, Neural Networks, Decision Trees, Support Vector Machine, Naive Bayes, and Lasso Regression, for potential biomarkers were identified, with an emphasis being placed on the diagnostic significance of the Monocyte-to-Lymphocyte Ratio. Distinct inflammatory signatures associated with bacterial infections were spotlighted in this study. An innovation introduced in this research was the adaptation of the high-accuracy Lasso Regression model for the TI-84 calculator, with an AUC (95% CI) of 0.94 (0.925–0.955) being achieved. Using this adaptation, pivotal laboratory parameters can be input on-the-spot and infection probabilities can be computed subsequently. This methodology embodies an improvement in diagnostics, facilitating more effective distinction between bacterial and viral infections while fostering judicious antibiotic use.


Patients' recruitment
The patients with pharyngitis were enrolled in the study through a systematic recruitment process that aimed to ensure the inclusion of eligible participants with complete and relevant data.The recruitment process followed several steps to identify and select suitable candidates: 1. Patient Identification: Potential participants with symptoms of pharyngitis were identified from the patient population attending the Department of Otorhinolaryngology at Hebei Provincial Hospital of Traditional Chinese Medicine, between 2019 and 2023.2. Screening for patients with pharyngitis: This screening involved a review of their medical history and a clinical examination.3. Inclusion Criteria: To be included in the study, patients had to meet the following criteria:

Independent variables
Several parameters, including basic demographic information, complete cell count, and novel parameters such as the NLR, platelet-to-lymphocyte ratio (PLR), monocyte-to-lymphocyte ratio (MLR), and mean platelet volumeto-lymphocyte ratio (MPVLR), were assessed to provide a comprehensive picture.These novel parameters have been log-transformed prior to analysis to manage skewness, stabilize variance, lessen the influence of outliers, and convert multiplicative relationships into more interpretable additive ones, enhancing the robustness and validity of our statistical tests.

Model deployment
The Lasso regression model was encapsulated into a TI-84 calculator via a custom script, engineered for rapid input of laboratory parameters.Upon input, an output delineating the infection type probability was generated.The model's performance was stringently evaluated and validated using our designated validation cohort.

Ethical compliance
This study is approved by the Medical Ethical Committee of Hebei Provincial Hospital of Traditional Chinese Medicine, the register num is HBZY2023-KY-012-01.A waiver for the requirement of informed consent has been granted by the Medical Ethical Committee of Hebei Provincial Hospital of Traditional Chinese Medicine due to its retrospective nature.Strict adherence to the ethical guidelines related to human subjects in research was maintained in our study.all their privacy and confidentiality were upheld throughout the study.

Results
A total of 693 patients diagnosed with pharyngitis were initially identified.Following rigorous adherence to predefined inclusion and exclusion criteria, a cohort of 197 eligible patients was delineated.This cohort included 74 individuals diagnosed with viral infections and 123 with bacterial infections, as depicted in Fig. 1.These participants were then methodically allocated into two primary cohorts for further analysis: the training cohort, consisting of 55 individuals with bacterial infections and 56 with viral infections, and the validation cohort, comprising 19 individuals with bacterial infections and 18 with viral infections.This stratification provided a structured framework for the comparative analysis of viral and bacterial pharyngitis cases, thereby facilitating the subsequent development and validation of machine learning models.
Demographic attributes such as sex, age, and clinical status (outpatient or inpatient) were recorded.An approximately balanced distribution of males and females was observed across both cohorts, aligning with the demographic findings in related literature 21 .Age ranged from 18 to 85 years, with the most represented age group being 18-34 years.Majority of the patients were outpatients, with no significant difference in distribution between the two infection types (Table 1).
Outcome measures focused on hematological indices.The observed variations included higher Red blood cell (RBC) count and Hemoglobin concentration (HGB) levels in patients with viral infections (p = 0.003 and p = 0.006, respectively) and elevated White Blood Cell (WBC) count, Neutrophil count (NEU) count, and Monocyte count (MONO) count in patients with bacterial infections (p < 0.01 for each parameter).Significant differences were noted for other parameters such as Percentage of Neutrophils (NEUp), Lymphocyte (LYM) count, Percentage of Lymphocytes (LYMp), log-transformed Monocyte to Lymphocyte ratio (logMLR), log-transformed Neutrophil to Lymphocyte ratio (logNLR), and log-transformed Platelet to Lymphocyte ratio (logPLR) between the two infection groups (Table 1).
A comparative analysis revealed significant differences in several hematological indices between the viral and bacterial infection groups in both the training and validation cohorts.Notably, in the training cohort, there were significant variations regarding HGB, WBC, NEU, NEUp, LYM, LYMp, logMLR, logNLR, logPLR, and log-transformed Platelet Volume to Lymphocyte ratio (logMPVLR) (all p < 0.05).Meanwhile, the validation cohort displayed significant differences for NEU, NEUp, LYMp, logMLR, and logNLR (all p < 0.05) (Table 2).These findings echo the inherent diagnostic challenges associated with pharyngitis, where overlapping symptoms between bacterial, primarily caused by Group A β-hemolytic streptococcus, and viral pharyngitis often complicate accurate diagnosis 22 .Although blood tests have been instrumental in aiding the diagnosis of acute viral and bacterial infections, their efficacy is sometimes hindered by their inability to capture the evolving inflammatory response post-symptom onset 23 .
The violin plots demonstrated distinct trends in hematological and inflammatory parameters between bacterial and viral infections.Parameters such as WBC and MONO had overlapping distributions, while NEU, NEUp, logMLR, logNLR, and logPLR were predominantly higher in bacterial infections.LYM and LYMp leaned more towards viral infections (Fig. 2).It was observed that parameters like WBC count and MONO count exhibited overlapping distributions, hinting at a common inflammatory response irrespective of the infection type.Conversely, parameters such as NEU, NEUp, logMLR, logNLR, and logPLR demonstrated elevated levels predominantly in bacterial infections.
The employment of machine learning methodologies was directed at determining their predictive efficacy on the dataset.Performance metrics for both the training and validation cohorts, including accuracy, precision, recall, F1-score, and the AUC were computed and are displayed in Tables 3 and 4. The Random Forest notably exhibited superior performance in terms of accuracy and AUC in both cohorts, aligning with findings from past studies on bacterial and viral infections 24 .Meanwhile, ROC curves for Lasso Regression and SVM models suggested a high degree of accuracy in infection type classification.During cross-validation on the training set, the optimized Lasso Regression model attained an AUC score of approximately 0.90.The model's robustness and generalizability were confirmed through its performance on a separate validation set, where it achieved an AUC score of approximately 0.94, demonstrating its ability to effectively distinguish between viral and bacterial infections (Fig. 3).
The feature importance of each variable was evaluated across different machine learning models, revealing NEUp, logMLR, and logPLR to be the significant.The highest importance scores in the Lasso Regression model were found for NEUp (2.0110), logMPVLR (1.0451), and logPLR (0.6210).In the Decision Tree model, a high importance score was assigned to the NEUp variable (0.5127).Notably, the Random Forest model showed elevated scores for NEUp (0.3024) and LYMp (0.1349).The SVM model indicated WBC (0.0528) and NEU (0.0722) as most important, while in the Neural Network model, logMPVLR (0.1306) had the highest score.In the Naive Bayes model, the WBC variable scored slightly higher (0.0556), underscoring their potential utility in diagnostic algorithms (Table 5).
Following deployment, the Lasso Regression model exhibited substantial adeptness in differentiating between bacterial and viral infections.By simply inputting the selected laboratory parameters into the TI-84 calculator, healthcare professionals could expeditiously generate infection probability outcomes (Fig. 4).The model was stringently assessed.The validation cohort in our study, included data from 37 patients (19 bacterial, 18 viral infections).The consistent and effective performance emphasizes the model's robustness and reliability.

Discussion
This study has highlighted hematological disparities between bacterial and viral infections, shedding light on the pronounced inflammatory response elicited predominantly by bacterial infections.The hematological parameters, MLR, NLR, PLR, and MPVLR have been emphasized as notable biomarkers 25,26 .In line with established www.nature.com/scientificreports/literature, viral infections are typically characterized by augmented RBC counts and HGB levels, while bacterial infections are more likely to display heightened WBC, NEU, and MONO counts 27 (Table 1).The distinct variations in key hematological parameters such as NEU, NEUp, LYMp, logMLR, and logNLR underscore the differential immunological responses between bacterial and viral infections (Table 2).It is wellestablished that neutrophils are the primary leukocytes engaged in immune responses during bacterial infections, while lymphocyte-mediated immune responses are predominantly observed during viral infections 28 .The substantial feature importance score of logMLR across various machine learning models accentuates its critical role as a distinguishing hematological parameter, potentially aiding in the enhanced diagnostic differentiation between bacterial and viral infections in our study.Nevertheless, the diagnostic quandaries stemming from the overlapping distributions of WBC and NEU, as illustrated in Fig. 2, emphasize the imperative for a broader diagnostic strategy, transcending the reliance on singular markers 29 .
From a computational viewpoint, the Random Forest emerged as the most proficient predictor for classifying infection types, albeit with Neural Networks showing close prowess 30 .Conversely, SVM and Naive Bayes showcased diverse performances, underscoring the imperative nature of meticulous model selection tailored to specific data characteristics 31 (Tables 3 and 4).Both Lasso Regression and Random Forest were proficient in differentiating bacterial from viral infections (Fig. 3).
In this study, Lasso Regression was utilized to create a diagnostic model for classifying infection types.The choice of Lasso Regression was predicated on its unique characteristics, which encompass both variable selection and regularization functionalities 32 .This makes it particularly suitable for this type of problem.Although more complex machine learning methodologies are available, Lasso Regression establishes a balance between model intricacy and interpretability 33 .This equilibrium is essential in clinical environments where elucidating the relationship between predictors and outcomes is as vital as achieving prediction accuracy 34,35 .
A noteworthy innovation of this work is the successful amalgamation of the Lasso model with a widely accessible computational tool, the TI-84 calculator.Although both Random Forest and Lasso Regression exhibited commendable performance in our analysis, the computational parsimony of Lasso Regression rendered it a Table 1.Comparative analysis of patient characteristics and hematological indices between patients with viral and bacterial induced pharyngitis.Lists the demographic and hematological parameters studied.Data are presented as n (%) for categorical variables, median (Interquartile Range, IQR) for non-normally distributed continuous variables, and mean ± standard deviation (sd) for normally distributed continuous variables.The standards of error analysis and ranges have been accounted for in the provided IQR and sd.www.nature.com/scientificreports/more pragmatic choice for the TI-84 calculator, known for its ease of use, cost-effectiveness, and ubiquitous availability.This integration facilitated the rapid and efficient input of pivotal laboratory parameters including NEU, MONO, NEUp, LYM, LYMp, PLT, and MPV, consequently generating the probability of infection type.By merely inputting the selected laboratory parameters into the calculator, healthcare professionals could promptly ascertain the likelihood of bacterial or viral infections (Fig. 4).No commercial reagents or specific equipment were required in this methodology, promoting its cost-effectiveness and widespread accessibility.
Integral to this discussion is the concept of antibiotic stewardship.Given the emerging global challenge of antibiotic resistance, it is imperative to differentiate bacterial from viral infections to ensure judicious antibiotic use 36 .These findings contribute significantly to antibiotic stewardship efforts by pinpointing potential biomarkers that might expedite accurate diagnosis, thereby minimizing unwarranted antibiotic prescriptions.Emphasizing the need for precise diagnosis and targeted therapies, this study underlines the importance of combining clinical, laboratory, and computational tools in the era of personalized medicine and antibiotic stewardship 37 .
The prospect of amplifying diagnostic precision through the amalgamation of optimization algorithms with machine learning methodologies is indeed exhilarating.Esteemed optimization algorithms such as the refined Grey Wolf Optimizer (LGWO) 38 , Hunger Games Search (HGS) 39 , Shrimp and Goby Association Search algorithm (SGA) 40 , Planet Optimization Algorithm (P.O.A.) 41 , and Runge Kutta optimizer (RUN) 42 possess the potential to significantly enhance model efficacy.Although the current study did not delve into these optimization techniques, the future incorporation of such advanced optimization algorithms to refine the machine learning models utilized in this study is a significant direction we plan to pursue.Table 2. Hematological parameters in bacterial and viral infections in training and validation cohorts.This table summarizes the hematological parameters and their logarithmically transformed ratios for both bacterial and viral infections in the training and validation cohorts.Variables include red blood cell count (RBC), hemoglobin (HGB), white blood cell count (WBC), neutrophils (NEU), monocytes (MONO), lymphocytes (LYM), platelets (PLT), mean platelet volume (MPV), monocyte to lymphocyte ratio (logMLR), neutrophil to lymphocyte ratio (logNLR), platelet to lymphocyte ratio (logPLR), and mean platelet volume to lymphocyte ratio (logMPVLR).Significance levels (p-values) are reported for each variable, with < 0.05 considered significant.

Limitations
This study acknowledges several limitations.The dataset, while comprehensive, encapsulates a specific patient population with unique characteristics that may influence the performance of the machine learning models.Potential confounding variables, including underlying health conditions and medication usage, were not rigorously controlled, possibly subtly impacting the outcomes.The generalizability of the findings may be contingent on the specific patient population from which the dataset was derived.Despite these considerations, the insights derived from this study are valuable, laying a groundwork for more exhaustive future investigations.

Conclusion
This study underscores the clinical necessity of accurately and swiftly distinguishing between bacterial and viral pharyngitis.By integrating traditional laboratory techniques with advanced machine learning, a new dimension to the diagnostic potential of hematological markers such as MLR was explored.The notable efficacy of the Random Forest and Lasso Regression in data prediction for this specific dataset suggests that exploring various machine learning techniques could hold promise for further diagnostic advancements.The adaptation of a Lasso Regression model for use in a TI-84 calculator showcased a practical application of machine learning in clinical settings, enhancing accessibility and ease of use compared to traditional nomograms.These findings illuminate hematological distinctions between viral and bacterial infections in adult patients with pharyngitis, offering MLR as a potential addition to diagnostic methodologies.This not only has the potential to enhance diagnostic accuracy but also to refine therapeutic interventions.
It would be beneficial to extend the application of this model to other types of infections, and to integrate more variables and machine learning techniques, thereby enhancing its utility in infectious disease diagnosis.The results from this study mark a step towards more precise and timely diagnosis of pharyngitis, contributing to better management and treatment of this common condition.

3. 1 .
Confirmed diagnosis of pharyngitis based on clinical evaluation.3.2.Absence of severe medical conditions or comorbidities that could confound the analysis.4. Exclusion Criteria: Patients with the following characteristics were excluded from the study: 4.1.Absence of essential demographic details or incomplete data pertaining to complete cell count metrics.4.2.Age below 18 years.4.3.Patients without a definitive diagnosis of infection type.

Figure 1 .
Figure 1.Flowchart of the study design and patient categorization.A comprehensive flowchart illustrating the data collection and selection process is provided.

Figure 2 .
Figure 2. Distribution of Hematological and Inflammatory Parameters Amid Bacterial and Viral Infections.The violin plots showcase the distribution of several hematological and inflammatory parameters including 'WBC' (white blood cell count), 'NEU' (neutrophils), 'NEUp' (neutrophil percentage), 'MONO' (monocytes), 'LYM' (lymphocytes), 'LYMp' (lymphocyte percentage), 'logMLR' (log-transformed monocyte-to-lymphocyte ratio), 'logNLR' (log-transformed neutrophil-to-lymphocyte ratio), and 'logPLR' (log-transformed plateletto-lymphocyte ratio) in cases of bacterial and viral infections.Each violin depicts the density distribution of the data, with the width indicating data density.The white dot represents the median, the thick bar illustrates the interquartile range, and the thin line encompasses the remaining data distribution, excluding outliers determined by a function of the interquartile range.These plots elucidate distinct trends between the two infection types.

Figure 4 .
Figure 4. Screenshot of the Lasso Regression Model Program on a TI-84 Calculator.This figure presents a screenshot of the TI-84 calculator running the developed Lasso regression model program.The program enables the user to input five laboratory parameters: Monocytes (MONO), Neutrophils percentage (NEUp), Lymphocytes (LYM), Lymphocytes percentage (LYMp), Platelets (PLT), and Mean Platelet Volume (MPV).The calculator subsequently generates the probability of the infection type, aiding in distinguishing between bacterial and viral infections.

Table 3 .
Performance metrics of machine learning models on the training cohort.AUC area under the curve, CI confidence interval, LR lasso regression, DT decision trees, RF random forest, SVM support vector machine, NN neural networks, NB naive bayes.

Table 4 .
Performance metrics of machine learning models on the validation cohort.AUC area under the curve, CI confidence interval, LR lasso regression, DT decision trees, RF random forest, SVM support vector machine, NN neural networks, NB naive bayes.Comparative Analysis of ROC Curves from Multiple Machine Learning Models.ROC curves for six different machine learning models: Lasso Regression, Decision Tree, Random Forest, Support Vector Machine, Neural Network, and Naive Bayes.The area under the curve (AUC) metric was used to evaluate the performance of each model, with a higher AUC indicating better performance.

Table 5 .
Feature importance across multiple machine learning models in differentiating bacterial and viral infections.LR lasso regression, DT decision trees, RF random forest, SVM support vector machine, NN neural networks, NB naive bayes.These values represent how much each feature contributes to the model's predictions.The larger the value, the more important the feature is.